Opal: In Vivo Based Preservation Framework for Locating Lost Web Pages

نویسندگان

  • Terry L. Harrison
  • Michael L. Nelson
چکیده

OPAL: IN VIVO BASED PRESERVATION FRAMEWORK FOR LOCATING LOST WEB PAGES Terry L. Harrison Old Dominion University, 2005 Director: Dr. Michael L. Nelson We present Opal, a framework for interactively locating missing web pages (http status code 404). Opal is an example of "in vivo" preservation: harnessing the collective behavior of web archives, commercial search engines, and research projects for the purpose of preservation. Opal servers learn from their experiences and are able to share their knowledge with other Opal servers using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Using cached copies that can be found on the web, Opal creates lexical signatures which are then used to search for similar versions of the web page. Using the OAI-PMH to facilitate inter-Opal learning extends the utilization of OAI-PMH in a novel manner. We present the architecture of the Opal framework, discuss a reference implementation of the framework, and present a quantitative analysis of the framework that indicates that Opal could be effectively deployed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Research on discovering deep web entries

Ontology plays an important role in locating Domain-Specific Deep Web contents, therefore, this paper presents a novel framework WFF for efficiently locating Domain-Specific Deep Web databases based on focused crawling and ontology by constructing Web Page Classifier(WPC), Form Structure Classifier(FSC) and Form Content Classifier(FCC) in a hierarchical fashion. Firstly, WPC discovers potential...

متن کامل

Navigating the World Wide Web

Navigation (colloquially known as “surfing”) is the activity of following links and browsing web pages. This is a time intensive activity engaging all web users seeking information. We often get “lost in hyperspace” when we lose the context in which we are browsing, giving rise to the infamous navigation problem. So, in this age of information overload we need navigational assistance to help us...

متن کامل

The Automatic Extraction of Web Information Based on Regular Expression

Based on search engine , this paper built a Web information retrieval matching and structure extraction model. And realized the algorithm of locating and automatically extracting multi-web Baidu news information. Getting the standard mathematical expression of URLs by analyzing the search results URLs and analyzing the DOM tree structure of web pages, this article designed the key tags regular ...

متن کامل

ارزیابی کیفیت صفحات‌ وب پژوهشگاه‌های وابسته به وزارت علوم، تحقیقات و فن‌آوری‌ مستقر در شهر تهران از دیدگاه کاربران

Especially in research centers, evaluating the quality of web pages from clients' point of view has a constructive role in their design and development, since it makes the web developers familiar with client's perspective and assists them in designing client-oriented web sites in scientific and research environment. As a model for assessing the quality of web pages, "webQual" attempts to provid...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005